Skip to content

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501

Merged
Lightheartdevs merged 7 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/multi-gpu
Mar 25, 2026
Merged

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501
Lightheartdevs merged 7 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/multi-gpu

Conversation

@y-coffee-dev
Copy link
Contributor

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism

Summary

Adds end-to-end multi-GPU support for NVIDIA systems. The installer now automatically detects multi GPU topology, assigns GPUs to services based on interconnect quality and VRAM capacity, and configures services for multi-gpu usage all without manual intervention. A custom assignment TUI is also available for advanced users.

Architecture

Topology Detection (nvidia-topo.sh)

Parses the nvidia-smi topo -m matrix to extract GPU-to-GPU link types and assigns numerical ranks:

GPU Assignment Algorithm (assign_gpus.py)

Four-phase pipeline:

  1. Topology Analysis — Parse GPUs and links, build rank matrix
  2. Subset Enumeration — Generate all GPU subsets, sorted by min link rank (desc), size (asc), VRAM (desc). Find the best subset that fits the model; if none fits, greedily span across GPUs
  3. Service Assignment — Allocate remaining GPUs to whisper/comfyui/embeddings based on availability:
    • 0 remaining: colocate all services on llama's last GPU
    • 1 remaining: all auxiliary services share that GPU
    • 2 remaining: whisper gets one, comfyui+embeddings share the other
    • 3+ remaining: dedicated GPUs; extras go back to llama
  4. Parallelism Selection — Based on GPU count and min link rank:
    • NVLink/XGMI (rank >= 80): tensor parallel (<=3 GPUs) or hybrid (>3 GPUs)
    • Same-NUMA PCIe (rank 11-79): pipeline (<=3 GPUs) or hybrid if rank >= 40
    • Cross-NUMA (rank <= 10): pipeline only
    • Heterogeneous VRAM: proportional tensor split weights

Compose Layering

When GPU_COUNT > 1, the stack adds:

  • docker-compose.multigpu.yml — llama-server GPU pinning + split mode
  • extensions/services/*/compose.multigpu.yaml — per-service GPU pinning

Interactive TUI

Multi-GPU systems get a configuration prompt:

  • [1] Automatic — runs assign_gpus.py with detected topology
  • [2] Custom — manual GPU-to-service assignment

Non-interactive installs default to automatic assignment.

Test coverage

Automated tests

  • tests/test-nvidia-topo.sh — Tests topology matrix parsing against 7 fixture files covering 1-GPU through 8-GPU configurations, NVLink/PCIe/NUMA topologies, and edge cases like NIC rows in the matrix
  • tests/test-assign-gpus.py — Comprehensive pytest suite covering:
    • Single GPU: strategy, service sharing, parallelism mode, model-too-large error
    • 2-GPU PHB: colocated strategy, pipeline parallelism
    • 4-GPU SOC (cross-NUMA): pipeline mode, dedicated strategy
    • 4-GPU SYS + NV pairs: mixed topology handling
    • 5-GPU NV12 + MLX5: NVLink with NIC filtering
    • 8-GPU NV12 full mesh: tensor/hybrid parallelism selection
    • 8-GPU NV1/NV2 partial mesh: degraded NVLink handling
    • VRAM overflow / span subset scenarios
    • Heterogeneous GPU tensor split proportions

Manual hardware testing

Thoroughly tested on several multi-GPU machines with various configurations including (non-exhaustive):

  • 2x NVIDIA RTX 3060
  • 4x NVIDIA RTX 4080
  • 4x NVIDIA RTX 5060 Ti

All tests confirmed correct topology detection, appropriate strategy selection and proper compose overlay.

What changed

New files

File Purpose
installers/lib/nvidia-topo.sh NVIDIA topology detection library — parses nvidia-smi topo -m matrix into structured JSON with link types, ranks, and labels
scripts/assign_gpus.py GPU assignment algorithm — 4-phase pipeline: topology analysis, subset enumeration, service assignment, parallelism selection
docker-compose.multigpu.yml Compose overlay for llama-server with NVIDIA_VISIBLE_DEVICES, LLAMA_ARG_SPLIT_MODE, and LLAMA_ARG_TENSOR_SPLIT
extensions/services/comfyui/compose.multigpu.yaml Per-service GPU pinning overlay for ComfyUI
extensions/services/whisper/compose.multigpu.yaml Per-service GPU pinning overlay for Whisper
extensions/services/embeddings/compose.multigpu.yaml Per-service GPU pinning overlay for Embeddings
tests/test-nvidia-topo.sh Shell tests for topology parsing against fixture matrices
tests/test-assign-gpus.py Python tests covering single GPU, 2-GPU colocated, 4-GPU NVLink/SYS, 5-GPU NVLink, 8-GPU full mesh/partial mesh topologies
tests/fixtures/topology_json/*.json (8 files) JSON topology fixtures: 1-GPU PCIe, 2-GPU PHB, 4-GPU SOC, 4-GPU SYS+NV pairs, 5-GPU NV12+MLX5, 8-GPU NV12 full mesh, 8-GPU NV12+NUMA, 8-GPU NV1/NV2 partial mesh
tests/fixtures/topology_matrix/*.txt (7 files) Raw nvidia-smi topo -m output fixtures for shell-level testing

Modified files

File Change
installers/phases/01-preflight.sh Adds jq and python3 to preflight dependency checks (required by topology detection and assignment)
installers/phases/02-detection.sh Integrates detect_nvidia_topo() — populates GPU_TOPOLOGY_JSON, GPU_HAS_NVLINK, GPU_TOTAL_VRAM, LLM_MODEL_SIZE_MB
installers/phases/03-features.sh Major expansion — multi-GPU configuration TUI with automatic and custom assignment modes, parallelism selection, env var extraction
installers/phases/04-requirements.sh Adds multi-GPU compose overlay to requirements
installers/phases/06-directories.sh Persists GPU_ASSIGNMENT_JSON and per-service GPU UUIDs to .env
installers/lib/constants.sh Adds multi-GPU related constants
installers/lib/tier-map.sh Adds multi-GPU tier mappings
installers/lib/compose-select.sh Includes docker-compose.multigpu.yml when GPU_COUNT > 1
scripts/resolve-compose-stack.sh Accepts --gpu-count flag; discovers and merges compose.multigpu.yaml from extensions
scripts/detect-hardware.sh Sources nvidia-topo.sh for topology detection
scripts/build-capability-profile.sh Reads actual gpu.count from capability profile instead of hardcoding 1
.env.schema.json Adds new env vars: GPU_ASSIGNMENT_JSON_B64, LLAMA_SERVER_GPU_UUIDS, LLAMA_ARG_SPLIT_MODE, LLAMA_ARG_TENSOR_SPLIT, EMBEDDINGS_GPU_UUID, COMFYUI_GPU_UUID, WHISPER_GPU_UUID, N_GPU_LAYERS

Copy link
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Needs Work

Strong algorithm and good test coverage (561 lines of pytest), but a few issues need resolving before merge:

1. jq promoted from optional to required (breaking)

01-preflight.sh now hard-requires jq. This will fail installs on minimal systems (e.g., fresh Debian/Alpine containers) that previously worked fine. Either:

  • Auto-install jq (like Docker is auto-installed in phase 05), or
  • Keep it optional with graceful degradation when absent

2. No CI checks have run

This branch has zero CI results. Please push a commit or re-trigger CI so we can see if it passes the test matrix.

3. Docker Compose GPU reservation conflict

docker-compose.multigpu.yml sets both NVIDIA_VISIBLE_DEVICES env var AND deploy.resources.reservations.devices without device_ids. The reservation block will reserve ALL GPUs while the env var tries to limit visibility. These two mechanisms conflict — pick one or wire device_ids dynamically.

4. Minor: duplicate comment line

constants.sh has INSTALL_START_EPOCH listed twice in the "Provides" header comment.

What's good

  • The topology detection with nvidia-smi topo -m fallback is well-handled
  • assign_gpus.py algorithm is correct and the O(2^N) subset enumeration is fine for realistic GPU counts
  • Single-GPU path is preserved (gated on GPU_COUNT > 1)
  • Graceful degradation when nvidia-smi is absent

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

@y-coffee-dev
Copy link
Contributor Author

@Lightheartdevs Thanks for the thorough review! I adjusted the PR.

1. jq auto-install - Good catch. I've added auto-install logic for jq.
2. CI - Pushed an adjustments commit, this should trigger the CI pipeline
3. Docker Compose GPU reservation - In fact this is not a conflict, the current setup is intentional and correct, as device_ids in the deploy.resources.reservations.devices block can't be set dynamically, because Docker Compose variable interpolation only produces scalar strings, and since device_ids expects a YAML sequence, there's no way to inject a list like ['0', '2'] from an environment variable.
The two mechanisms are layering, ie. the deploy reservation makes all GPUs available at the Docker level, for the NVIDIA container runtime to use NVIDIA_VISIBLE_DEVICES to scope which GPUs are actually visible inside the container at the runtime level.
This is a common approach when you need dynamic per-container GPU assignment in Compose.

4. INSTALL_START_EPOCH duplication - Fixed!

I appreciate the detailed feedback!

@Lightheartdevs
Copy link
Collaborator

Review Update — Rebase Required Before Merge

Hey @y-coffee-dev, great work addressing the previous review items. The code itself is solid and we want to get this merged. However, we found a critical issue that needs attention first.

🚨 Silent merge bug: LLM_MODEL_SIZE_MB will be dropped

Since you branched, we merged #572/#573/#574 which rewrote the model names and URLs in tier-map.sh (Qwen 3 → Qwen 3.5). Your branch adds LLM_MODEL_SIZE_MB to each tier in that same file.

Git reports a clean merge — no conflicts — but the result silently drops all 11 of your LLM_MODEL_SIZE_MB additions. This happens because git sees main's rewrites and your additions as non-overlapping changes within each tier block, and resolves by taking main's version (which has no LLM_MODEL_SIZE_MB).

What breaks: assign_gpus.py gets called with --model-size ""float("") → ValueError → multi-GPU assignment fails on every install. Single-GPU installs are fine (early return guard), but the entire multi-GPU feature would be DOA.

What's needed

  1. Rebase onto current main (commit 5a932e9)
  2. Re-add LLM_MODEL_SIZE_MB to each tier. The new Qwen 3.5 model sizes (update as needed):
CLOUD:      LLM_MODEL_SIZE_MB=0
ARC:        LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
ARC_LITE:   LLM_MODEL_SIZE_MB=2870    # Qwen3.5-4B-Q4_K_M
NV_ULTRA:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_LARGE:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_COMPACT: LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)
Tier 0:     LLM_MODEL_SIZE_MB=1500    # Qwen3.5-2B-Q4_K_M
Tier 1:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 2:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 3:     LLM_MODEL_SIZE_MB=16400   # Qwen3.5-27B-Q4_K_M
Tier 4:     LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)

⚠️ Double-check these against the actual GGUF file sizes on HuggingFace — the Qwen 3.5 models are new and some sizes may differ from the Qwen 3 equivalents you had before.

  1. Push — this should also trigger CI, which hasn't run yet on this branch.

Everything else looks good

We did a full merge simulation and traced every touched installer file. The single-GPU path is completely safe — your guards in 02-detection.sh (GPU_COUNT -gt 1) and 03-features.sh (GPU_COUNT -le 1 → return) are clean. The compose layering, hardware detection additions, and .env generation all use safe defaults. No behavioral changes for existing single-GPU installs on any backend.

Two minor suggestions for a follow-up (non-blocking):

  • Add trap "rm -f $TOPOLOGY_FILE" EXIT after the mktemp in 03-features.sh to clean up on early exit
  • Add a # NOTE: keep in sync with assign_gpus.py comment in the custom TUI parallelism logic in 03-features.sh, since it duplicates the threshold logic from the Python script

Looking forward to the rebase — this is a great feature and we want to ship it. 🚀

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- Enhanced multi-GPU tier assignment based on topology
- Implemented robust GPU topology detection for NVIDIA
- Implemented GPU link ranking from the fastest to the slowest, for optimal strategy selection in the future phases
- Implemented gathering detailed per-GPU information
- Data structures for GPU information storage
- Robust and comprehensive test suite for NVIDIA topology detection
- Multi-GPU strategy selection algorithm
- Careful handling of edge cases and subtle bugs in strategy selection
- Robust test suite for multi-GPU strategy selection algorithm

GPU assignment and parallelization strategy selection algo, clustering GPUs by topology links to find the optimal setup, Multi-GPU configuration TUI, docker compose overlays for multi-gpu setups

Adjust env schema validation

Fixed inconsistencies in gpu count, json escaping issues, etc

fix issue with writing multigpu overlay

fix resolve-compose-stack.sh multi gpu overlay

fix gpu device id

Refactors + less convoluted docker compose setup

N_GPU_LAYERS validation

fix multi-gpu overlay
@Lightheartdevs
Copy link
Collaborator

Code Review — 2026-03-24

Overall: Well-architected PR. The topology detection, strategy selection, and compose overlay pattern are all clean and consistent with existing conventions. Good test coverage (7 fixtures). Not merging yet due to conflicts and testing requirements, but this is on track.

Merge Conflicts (5 files)

This PR was branched before the Lemonade integration (19 PRs merged 2026-03-24). The following files will conflict:

  • 02-detection.sh — we added cpu/apple case handlers (fix(installer): CPU backend wrongly overridden to nvidia when capability profile loaded #596) and the tier assignment block was modified
  • 06-directories.sh — heavily modified for Lemonade .env generation, LiteLLM config, DREAM_MODE
  • constants.sh — VERSION bumped to 2.4.0
  • .env.schema.json — several new keys added (DREAM_MODE, LLM_BACKEND, LLM_API_BASE_PATH, TARGET_API_KEY, OPENAI_API_KEY)
  • resolve-compose-stack.shlemonade mode added to compose stack resolution

Action needed: Rebase onto current main and resolve conflicts. We can help with this if needed.

Revision Requests

  1. Verify INTERACTIVE guard on manual GPU assignment. Phase 03 adds interactive prompts for custom GPU assignment. Confirm these are gated by [[ "$INTERACTIVE" == "true" ]] so non-interactive installs (CI, scripted, --yes) don't hang waiting for input.

  2. jq is now a hard dependency. Phase 01 auto-installs jq if missing — this changes jq from optional to required. Acceptable given multi-GPU needs it, but:

    • Add jq to the README prerequisites list
    • Consider gating the auto-install behind GPU_COUNT > 1 so single-GPU users aren't surprised
  3. LLM_MODEL_SIZE_MB — missing newline at EOF in .env.schema.json. The diff shows the trailing } lost its newline. Minor but fails some linters.

  4. compose.multigpu.ymlNVIDIA_VISIBLE_DEVICES default. Currently defaults to all when LLAMA_SERVER_GPU_UUIDS is empty. On a multi-GPU system where assignment runs, this should always be set. But if the assignment fails silently, all is a safe fallback. Consider logging a warning when falling back.

  5. Temp file cleanup. TOPOLOGY_FILE is created via mktemp in Phase 03 but never cleaned up. Add a trap or explicit rm at the end of the phase.

Testing Requirements

  • Needs testing on actual multi-GPU hardware (we don't have any in the current test matrix)
  • The 7 topology fixtures cover detection well, but end-to-end install → compose up → services running on assigned GPUs hasn't been validated
  • Specifically: verify NVIDIA_VISIBLE_DEVICES with UUID list actually constrains the right containers

What's Good

  • nvidia-topo.sh is a clean pure-function library — follows the lib/ conventions perfectly
  • assign_gpus.py strategy engine is well-separated from bash
  • Compose overlay pattern (compose.multigpu.yaml per service) is consistent with existing GPU overlays
  • Test fixtures are thorough (1-GPU PCIe through 8-GPU NVLink full mesh with NUMA)
  • LLM_MODEL_SIZE_MB in tier-map is a useful addition beyond multi-GPU
  • NVLink vs PCIe strategy selection (tensor split vs pipeline) is the right approach

Status: Defer merge until conflicts resolved and multi-GPU hardware testing available. Happy to help with conflict resolution when ready.

@y-coffee-dev
Copy link
Contributor Author

Hey @Lightheartdevs! Thanks for the thorough review across installer files, really appreciate the care that went into this.

Good catch on the potential silent drop, I rebased and made sure all LLM_MODEL_SIZE_MB values survived.

Revision requests

INTERACTIVE guard: confirmed and added an explicit early-return guard at the top of run_custom() as an extra safety net, on top of the existing call-site gating.

jq: After I checked deeper, it turns out jq was already a hard dependency before this PR. scripts/validate-env.sh has been using it to parse .env.schema.json on every install regardless of GPU count. So single-GPU users were already getting it. Added it to the README prerequisites.

.env.schema.json newline + LLM_MODEL_SIZE_MB: fixed the trailing newline and added LLM_MODEL_SIZE_MB as a proper schema entry

NVIDIA_VISIBLE_DEVICES fallback warning: added a warn in phase 03 right after GPU assignment is extracted.

Temp file cleanup: added trap "rm -f $TOPOLOGY_FILE" EXIT and added an explicit rm -f "$TOPOLOGY_FILE" at the end of the phase so it cleans up promptly.

Hardware testing

I tested this end-to-end on several multi-GPU machines before, install through compose up, services running on assigned GPUs, and confirmed NVIDIA_VISIBLE_DEVICES with UUID lists actually constrains the right containers. That said, totally understand if you'd rather wait until the CI matrix has multi-GPU coverage before merging.

Thanks again for the detailed feedback!

@Lightheartdevs
Copy link
Collaborator

Code Review — 2026-03-25

Solid multi-GPU feature. Architecture is clean, test coverage is thorough, single-GPU and AMD paths are unaffected. Ready to merge with minor notes:

Non-blocking observations

  1. jq dependency — Phase 02 now requires jq for topology JSON parsing. README mentions it but Phase 01 preflight doesn't check for it. The installer already auto-installs jq per the README note, so this should be fine in practice.

  2. base64 -w0 is GNU-specific (line 293, Phase 06). macOS base64 doesn't support -w0. Not a practical issue since multi-GPU is gated by GPU_BACKEND == nvidia and GPU_COUNT > 1 — no Mac will hit this path. But if anyone adds AMD multi-GPU later, this needs a portable wrapper.

  3. Custom mode duplicates assign_gpus.py parallelism logic in bash (Phase 03, ~line 240). Comment says "keep in sync." Consider calling assign_gpus.py with a custom topology JSON from the custom assignments instead, to avoid divergence.

  4. LLM_MODEL_SIZE_MB — verify this is always set before Phase 03 runs. If the tier map doesn't populate it, the assignment algorithm will error.

What's good

  • Topology fixture coverage is excellent (8 configs from 1-GPU to 8-GPU NVLink mesh)
  • Graceful degradation on detection failure (warn, skip, continue)
  • Compose overlay pattern is consistent with existing AMD/CPU/Apple overlays
  • Interactive custom mode gives power users full control
  • Single-GPU installs see zero change

Approved. ✅

@Lightheartdevs Lightheartdevs merged commit d92d789 into Light-Heart-Labs:main Mar 25, 2026
12 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants